Stat Comp Assignment 1

John Wilshire, Jono Chan :)

17/08/2017

English premier league

Data set

We found a “Huge” soccer database:

Data set

That was scraped from:

The dataset

Size:

Total dataset:

Games from 2008/2009 season up to - 2015/2016 season

EPL

We took a subset of this, the EPL * ~3k matches * 1.3k epl players * 35 teams (teams are promoted and relegated out of the epl)

total 380 games per season

##                 [,1]                 
## number_of_games "3040"               
## min(date)       "2008-08-16 00:00:00"
## max(date)       "2016-05-17 00:00:00"

Exploratory Data analysis

Exploratory Analysis

Distribution of scores

Soccer is very low scoring

Data manipulation

how the team has done in previous games this season

Team ratings

For each team, (home, away) we have their 11 player lineup. We can join this with the player statisitcs table and using the closest assessment before the game we can then aggregate and use these scores as a measure of how well we think a team is.

Correlation plots

Final cleaning before modelling

Modeling

Home team wins glm (model 1)

Model on everything

glm(home_win ~ . , family = binomial(), data = epl4 %>% select(-outcome, -matches('goal|outcome'))) -> home_full_glm
summary(home_full_glm)
## 
## Call:
## glm(formula = home_win ~ ., family = binomial(), data = epl4 %>% 
##     select(-outcome, -matches("goal|outcome")))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1272  -0.9972  -0.5865   1.0674   2.4146  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -0.836286   2.579394  -0.324 0.745773    
## cumulative_margin_home        0.014607   0.003996   3.655 0.000257 ***
## cumulative_margin_away       -0.012342   0.003999  -3.086 0.002026 ** 
## overall_rating_mean_home     -0.024619   0.056367  -0.437 0.662278    
## potential_mean_home           0.042015   0.040867   1.028 0.303907    
## crossing_mean_home            0.001653   0.017336   0.095 0.924050    
## finishing_mean_home          -0.008298   0.018770  -0.442 0.658431    
## heading_accuracy_mean_home    0.003069   0.020429   0.150 0.880589    
## short_passing_mean_home       0.007477   0.031274   0.239 0.811043    
## volleys_mean_home             0.031482   0.015981   1.970 0.048850 *  
## dribbling_mean_home           0.003786   0.022912   0.165 0.868762    
## curve_mean_home              -0.008841   0.016549  -0.534 0.593189    
## free_kick_accuracy_mean_home  0.035521   0.014564   2.439 0.014727 *  
## long_passing_mean_home        0.011767   0.025291   0.465 0.641737    
## ball_control_mean_home       -0.021327   0.039116  -0.545 0.585604    
## acceleration_mean_home       -0.009251   0.029511  -0.313 0.753910    
## sprint_speed_mean_home        0.001683   0.028344   0.059 0.952647    
## agility_mean_home            -0.006141   0.020909  -0.294 0.768998    
## reactions_mean_home           0.049184   0.029570   1.663 0.096253 .  
## balance_mean_home             0.003522   0.015782   0.223 0.823418    
## shot_power_mean_home         -0.015660   0.019497  -0.803 0.421881    
## jumping_mean_home             0.030277   0.016578   1.826 0.067793 .  
## stamina_mean_home            -0.005951   0.019266  -0.309 0.757396    
## strength_mean_home            0.017964   0.021563   0.833 0.404801    
## long_shots_mean_home         -0.023499   0.020291  -1.158 0.246821    
## aggression_mean_home         -0.007986   0.015100  -0.529 0.596878    
## interceptions_mean_home       0.004172   0.018856   0.221 0.824893    
## positioning_mean_home        -0.005696   0.019206  -0.297 0.766789    
## vision_mean_home              0.022139   0.019647   1.127 0.259820    
## penalties_mean_home           0.015169   0.014267   1.063 0.287681    
## marking_mean_home            -0.015385   0.023646  -0.651 0.515275    
## standing_tackle_mean_home     0.011226   0.029581   0.380 0.704307    
## sliding_tackle_mean_home     -0.037010   0.023222  -1.594 0.110990    
## gk_diving_mean_home           0.050944   0.039854   1.278 0.201156    
## gk_handling_mean_home         0.076229   0.046216   1.649 0.099067 .  
## gk_kicking_mean_home          0.009853   0.016814   0.586 0.557900    
## gk_positioning_mean_home     -0.034341   0.046920  -0.732 0.464222    
## gk_reflexes_mean_home        -0.084819   0.046099  -1.840 0.065778 .  
## overall_rating_mean_away     -0.029729   0.056201  -0.529 0.596817    
## potential_mean_away          -0.042012   0.039831  -1.055 0.291533    
## crossing_mean_away           -0.009620   0.016818  -0.572 0.567305    
## finishing_mean_away           0.001267   0.018399   0.069 0.945119    
## heading_accuracy_mean_away   -0.014740   0.019712  -0.748 0.454609    
## short_passing_mean_away      -0.028182   0.030992  -0.909 0.363179    
## volleys_mean_away             0.014514   0.015491   0.937 0.348805    
## dribbling_mean_away           0.039078   0.022972   1.701 0.088920 .  
## curve_mean_away               0.013912   0.016340   0.851 0.394541    
## free_kick_accuracy_mean_away -0.011804   0.014059  -0.840 0.401138    
## long_passing_mean_away       -0.004881   0.024790  -0.197 0.843920    
## ball_control_mean_away       -0.045529   0.039363  -1.157 0.247423    
## acceleration_mean_away        0.035373   0.029210   1.211 0.225888    
## sprint_speed_mean_away       -0.059708   0.028065  -2.128 0.033377 *  
## agility_mean_away             0.020644   0.020371   1.013 0.310853    
## reactions_mean_away           0.013291   0.028801   0.461 0.644447    
## balance_mean_away            -0.021616   0.015693  -1.377 0.168387    
## shot_power_mean_away         -0.007514   0.019397  -0.387 0.698481    
## jumping_mean_away             0.019873   0.016817   1.182 0.237324    
## stamina_mean_away            -0.003418   0.019262  -0.177 0.859174    
## strength_mean_away            0.041473   0.021206   1.956 0.050501 .  
## long_shots_mean_away         -0.008512   0.020198  -0.421 0.673461    
## aggression_mean_away          0.006474   0.014975   0.432 0.665536    
## interceptions_mean_away      -0.020670   0.018659  -1.108 0.267959    
## positioning_mean_away         0.003029   0.018353   0.165 0.868900    
## vision_mean_away              0.009772   0.019392   0.504 0.614330    
## penalties_mean_away          -0.010008   0.013918  -0.719 0.472074    
## marking_mean_away             0.008284   0.023418   0.354 0.723517    
## standing_tackle_mean_away     0.018267   0.029391   0.622 0.534264    
## sliding_tackle_mean_away     -0.016159   0.023148  -0.698 0.485139    
## gk_diving_mean_away          -0.001992   0.038098  -0.052 0.958308    
## gk_handling_mean_away        -0.035461   0.045440  -0.780 0.435156    
## gk_kicking_mean_away          0.015038   0.016713   0.900 0.368237    
## gk_positioning_mean_away      0.001995   0.046934   0.043 0.966097    
## gk_reflexes_mean_away        -0.005658   0.045912  -0.123 0.901917    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4192.1  on 3039  degrees of freedom
## Residual deviance: 3695.4  on 2967  degrees of freedom
## AIC: 3841.4
## 
## Number of Fisher Scoring iterations: 4
AIC(home_full_glm)
## [1] 3841.443
cat('full model, 72 predictors')
## full model, 72 predictors
caret::confusionMatrix(table(predict(home_full_glm, epl4) > 0, full$home_win))
## Confusion Matrix and Statistics
## 
##        
##         FALSE TRUE
##   FALSE  1214  586
##   TRUE    436  804
##                                           
##                Accuracy : 0.6638          
##                  95% CI : (0.6467, 0.6806)
##     No Information Rate : 0.5428          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3169          
##  Mcnemar's Test P-Value : 3.15e-06        
##                                           
##             Sensitivity : 0.7358          
##             Specificity : 0.5784          
##          Pos Pred Value : 0.6744          
##          Neg Pred Value : 0.6484          
##              Prevalence : 0.5428          
##          Detection Rate : 0.3993          
##    Detection Prevalence : 0.5921          
##       Balanced Accuracy : 0.6571          
##                                           
##        'Positive' Class : FALSE           
## 

Reduced model

glm(home_win ~ . , family = binomial(), data = full %>% select(-outcome, -matches('goal'))) -> home_glm
AIC(home_glm)
## [1] 3782.579
summary(home_glm)
## 
## Call:
## glm(formula = home_win ~ ., family = binomial(), data = full %>% 
##     select(-outcome, -matches("goal")))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1203  -1.0188  -0.6208   1.0809   2.2515  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -1.623131   1.591838  -1.020    0.308    
## cumulative_margin_home    0.016625   0.003808   4.366 1.27e-05 ***
## cumulative_margin_away   -0.015635   0.003807  -4.107 4.02e-05 ***
## overall_rating_mean_home  0.113161   0.014920   7.585 3.34e-14 ***
## overall_rating_mean_away -0.094396   0.014637  -6.449 1.13e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4192.1  on 3039  degrees of freedom
## Residual deviance: 3772.6  on 3035  degrees of freedom
## AIC: 3782.6
## 
## Number of Fisher Scoring iterations: 4
predict(home_glm, full) -> home_preds
cat('full model, 72 predictors ')
## full model, 72 predictors
caret::confusionMatrix(table(home_preds > 0, full$home_win))
## Confusion Matrix and Statistics
## 
##        
##         FALSE TRUE
##   FALSE  1210  611
##   TRUE    440  779
##                                           
##                Accuracy : 0.6543          
##                  95% CI : (0.6371, 0.6712)
##     No Information Rate : 0.5428          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2966          
##  Mcnemar's Test P-Value : 1.573e-07       
##                                           
##             Sensitivity : 0.7333          
##             Specificity : 0.5604          
##          Pos Pred Value : 0.6645          
##          Neg Pred Value : 0.6390          
##              Prevalence : 0.5428          
##          Detection Rate : 0.3980          
##    Detection Prevalence : 0.5990          
##       Balanced Accuracy : 0.6469          
##                                           
##        'Positive' Class : FALSE           
## 
# Summary plot
par(mfrow = c(2,2))
plot(home_glm)

Model 2

Multinomial logistic regression using nnet package

## # weights:  18 (10 variable)
## initial  value 3339.781358 
## iter  10 value 2987.333352
## final  value 2986.500537 
## converged
## Confusion Matrix and Statistics
## 
##         
## mn_preds    A    D    H
##        A  438  237  255
##        D    0    2    0
##        H  429  544 1135
## 
## Overall Statistics
##                                          
##                Accuracy : 0.5181         
##                  95% CI : (0.5002, 0.536)
##     No Information Rate : 0.4572         
##     P-Value [Acc > NIR] : 1.025e-11      
##                                          
##                   Kappa : 0.1908         
##  Mcnemar's Test P-Value : < 2.2e-16      
## 
## Statistics by Class:
## 
##                      Class: A  Class: D Class: H
## Sensitivity            0.5052 0.0025543   0.8165
## Specificity            0.7736 1.0000000   0.4103
## Pos Pred Value         0.4710 1.0000000   0.5384
## Neg Pred Value         0.7967 0.7429230   0.7264
## Prevalence             0.2852 0.2575658   0.4572
## Detection Rate         0.1441 0.0006579   0.3734
## Detection Prevalence   0.3059 0.0006579   0.6934
## Balanced Accuracy      0.6394 0.5012771   0.6134

Model 3

Predicting the scores (Poission)

for both the home team and the away team

## 
## Call:
## glm(formula = home_team_goal ~ ., family = poisson(), data = full %>% 
##     select(-away_team_goal, -outcome, -home_win))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5051  -0.8850  -0.1596   0.5343   3.7227  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -1.051697   0.610270  -1.723 0.084829 .  
## cumulative_margin_home    0.004682   0.001302   3.595 0.000325 ***
## cumulative_margin_away   -0.006371   0.001398  -4.558 5.17e-06 ***
## overall_rating_mean_home  0.045638   0.005495   8.306  < 2e-16 ***
## overall_rating_mean_away -0.026601   0.005475  -4.859 1.18e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 3789.7  on 3039  degrees of freedom
## Residual deviance: 3396.9  on 3035  degrees of freedom
## AIC: 9275.9
## 
## Number of Fisher Scoring iterations: 5
## 
## Call:
## glm(formula = away_team_goal ~ ., family = poisson(), data = full %>% 
##     select(-home_team_goal, -outcome, -home_win))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4469  -1.2986  -0.1242   0.5701   3.1087  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -1.019710   0.703846  -1.449  0.14740    
## cumulative_margin_home   -0.004480   0.001622  -2.763  0.00573 ** 
## cumulative_margin_away    0.001785   0.001518   1.176  0.23959    
## overall_rating_mean_home -0.039281   0.006433  -6.106 1.02e-09 ***
## overall_rating_mean_away  0.054050   0.006260   8.635  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 3886.2  on 3039  degrees of freedom
## Residual deviance: 3571.6  on 3035  degrees of freedom
## AIC: 8367.6
## 
## Number of Fisher Scoring iterations: 5
## Confusion Matrix and Statistics
## 
##        
##         FALSE TRUE
##   FALSE   684  255
##   TRUE    966 1135
##                                           
##                Accuracy : 0.5984          
##                  95% CI : (0.5807, 0.6158)
##     No Information Rate : 0.5428          
##     P-Value [Acc > NIR] : 3.635e-10       
##                                           
##                   Kappa : 0.2221          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.4145          
##             Specificity : 0.8165          
##          Pos Pred Value : 0.7284          
##          Neg Pred Value : 0.5402          
##              Prevalence : 0.5428          
##          Detection Rate : 0.2250          
##    Detection Prevalence : 0.3089          
##       Balanced Accuracy : 0.6155          
##                                           
##        'Positive' Class : FALSE           
## 

Future

Our data ended up being very high dimensional, We could explore methods of reducing the dimensionality of our dataset (with PCA)

Packages used

Data manipulation: * dplyr * knitr * tidyr * lubridate Graphics * ggplot2 * plotly (Interactive one) * corrplot